Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stack overflow fix - eliminate recursive implementation #4997

Merged
merged 1 commit into from
Jul 27, 2018

Conversation

kgururaj
Copy link
Collaborator

Fix for #4994

The recursive implementation should not have been accepted - my mistake.

@droazen
Copy link
Contributor

droazen commented Jul 12, 2018

I'd suggest doing a bug fix release @lbergelson once this fix goes in.

@kgururaj
Copy link
Collaborator Author

Tested with 10K intervals and 100 WES samples

@codecov-io
Copy link

codecov-io commented Jul 12, 2018

Codecov Report

Merging #4997 into master will increase coverage by 0.03%.
The diff coverage is 100%.

@@              Coverage Diff               @@
##              master     #4997      +/-   ##
==============================================
+ Coverage     86.367%   86.397%   +0.03%     
- Complexity     28832     28945     +113     
==============================================
  Files           1791      1791              
  Lines         133603    134348     +745     
  Branches       14920     15063     +143     
==============================================
+ Hits          115389    116073     +684     
- Misses         12810     12858      +48     
- Partials        5404      5417      +13
Impacted Files Coverage Δ Complexity Δ
...ls/genomicsdb/GenomicsDBImportIntegrationTest.java 91.089% <100%> (+0.157%) 75 <2> (+2) ⬆️
...dinstitute/hellbender/utils/R/RScriptExecutor.java 80.282% <0%> (-8.451%) 17% <0%> (-3%)
...g/broadinstitute/hellbender/utils/io/Resource.java 55.556% <0%> (-7.407%) 6% <0%> (-1%)
...llbender/tools/walkers/bqsr/AnalyzeCovariates.java 67.593% <0%> (-4.63%) 29% <0%> (-1%)
...ute/hellbender/utils/recalibration/RecalUtils.java 89.407% <0%> (-3.814%) 52% <0%> (-1%)
.../sv/integration/SVIntegrationTestDataProvider.java 90.909% <0%> (-3.209%) 2% <0%> (+1%)
...te/hellbender/tools/spark/sv/utils/SVInterval.java 85.507% <0%> (-1.993%) 64% <0%> (+28%)
...walkers/bqsr/AnalyzeCovariatesIntegrationTest.java 92.063% <0%> (-1.783%) 22% <0%> (-2%)
...on/FindBreakpointEvidenceSparkIntegrationTest.java 100% <0%> (ø) 11% <0%> (+4%) ⬆️
...spark/sv/evidence/BreakpointDensityFilterTest.java 100% <0%> (ø) 28% <0%> (+11%) ⬆️
... and 9 more

@cmnbroad
Copy link
Collaborator

@lbergelson Do you want to look at this ? I can do a release tomorrow if we can get it in.

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One request for a test.

@kgururaj
Copy link
Collaborator Author

How do you want to test this? The error was triggered only if a large number of intervals (~1000) was imported by the tool.

@cmnbroad
Copy link
Collaborator

cmnbroad commented Jul 23, 2018

@kguraj It looks like there are existing integration tests that use intervals that cover a pretty wide genomic range. It should be easy to write a test that programmatically generates a large set of (10000) or so very small intervals (1bp) with small (1bp) gaps between them (the gaps are necessary since otherwise the intervals will be merged together by the engine) that fails without this change and passes with it. It doesn't necessarily have to verify the results, just successfully complete.

@kgururaj
Copy link
Collaborator Author

FYI: the test will take a long time to run.

Added the requested test.

@cmnbroad
Copy link
Collaborator

cmnbroad commented Jul 23, 2018

@kgururaj Thanks for adding the test. Running it locally on my laptop (without your fix) succeeds though - I have to bump it up from 1000 intervals to 9000 to reproduce the stack overflow. But if I do that, it takes a long time to run, since it appears to be creating lots of small partitions. Is there any way to get it to use fewer partitions in a case like this where there are lots of intervals ?

Somewhat more concerning is that when with 8000 intervals, I see a different failure mode. First I see lots (thousands) of these messages:

[GenomicsDB::VariantStorageManager] INFO: ignore message "[TileDB::StorageManager] Error: Cannot list TileDB directory; Directory buffer overflow." in the previous line

followed by a failure that ends like this:

[TileDB::StorageManager] Error: Cannot store schema; Too many open files in system. libc++abi.dylib: terminating with uncaught exception of type LoadOperatorException: LoadOperatorException : Could not define TileDB array TileDB error message : [TileDB::StorageManager] Error: Cannot store schema; Too many open files in system

Can you reproduce that ?

@kgururaj
Copy link
Collaborator Author

Is there any way to get it to use fewer partitions in a case like this where there are lots of intervals ?
No - see the PR message #4645 (comment)

The multi-interval support in GenomicsDBImport tool is purely for convenience. For scalability with a large number of intervals and samples, you should use multiple processes, each writing to a small (1?) number of intervals.

Somewhat more concerning is that when with 8000 intervals, I see a different failure mode. First I see lots (thousands) of these messages:
The first set of messages are spurious debug messages - no error in reality. I'll provide a jar without these messages.

The second set of messages are a result of too many file handles open per process - your system is limiting the number of file handles opened by a single process. Again, this goes back to the previous statement.

@cmnbroad
Copy link
Collaborator

cmnbroad commented Jul 24, 2018

@kgururaj Do those first messages originate here ? I'm not sure what GenomicsDB code path leads there, but it looks like TileDB considers them to be an error that results in a short-circuit return. I'd be concerned that masking them would hide some underlying error. Are there any tests that verify that data round-trips though GDB when an interval list large enough to trigger these messages is encountered ?

@kgururaj
Copy link
Collaborator Author

Yes, that's the error message. You should see this part of the GenomicsDB code that retries the function when an error is detected. So, you don't need to be concerned by those error messages. They are annoying though and I will disable them from being printed.

@cmnbroad
Copy link
Collaborator

cmnbroad commented Jul 25, 2018

@kguraj Thanks for the response(s). The test in the PR was super useful as a temporary test, but as you mentioned it runs pretty slowly, and as it stands the test passes on current master anyway. It seems to require on the order of 9000-1000 intervals instead rather than 1000 to actually hit stack overflow. Since that would be a very slow running test, I'm inclined to back it out.

Also, the user who originally reported the issue was using 11k intervals, and it seems that the stack overflow fix is unlikely to help in that case. Is there any guidance for users on what is a reasonable number of intervals per process ? It sounds like the intention was that it be used with pretty small intervals. Should we issue a warning message in GenomicsDBImport at some threshold number of intervals ?

Are you planning to produce a jar with the error messages suppressed for this PR?

@kgururaj
Copy link
Collaborator Author

It's your call whether you want the test or not

Yeah, it would be good to put an advisory message if the number of intervals is more than 100.

I uploaded a jar yesterday, but it's not showed up in Maven central

@kgururaj
Copy link
Collaborator Author

New jar without spurious debug messages

@cmnbroad
Copy link
Collaborator

@kgururaj Thx for the updated jar. Can you remove the test commit now, and then we can run this on travis once more and then we can merge ? Thx.

Use jar without TileDB verbose logging - error messages are printed out by GenomicsDB when required
@kgururaj
Copy link
Collaborator Author

Done

@cmnbroad
Copy link
Collaborator

@kgururaj It looks like the test is still included in the PR - I think it should come out since it doesn't use enough intervals to test the fix, and it would be too slow if it did.

@kgururaj
Copy link
Collaborator Author

It's no longer a test - just a private function. I just kept it in case somebody wishes to make it a good test in the future.

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kgururaj!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants